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ABSTRACT 

Motivation: Exome sequencing (exome-seq) data, which are typically 
used for calling exonic mutations, have also been utilized in detecting 
DNA copy number variations (CNVs). Despite the existence of several 
CNV detection tools, there is still a great need for a sensitive and an 
accurate CNV-calling algorithm with built-in QC steps, and does not 
require a paired reference for each sample. 

Results: We developed a novel method named PatternCNV, which (i) 
accounts for the read coverage variations between exons while lever- 
aging the consistencies of this variability across different samples; (ii) 
reduces alignment BAM files to WIG format and therefore greatly 
accelerates computation; (iii) incorporates multiple QC measures 
designed to identify outlier samples and batch effects; and (iv) pro- 
vides a variety of visualization options including chromosome, gene 
and exon-level views of CNVs, along with a tabular summarization of 
the exon-level CNVs. Compared with other CNV-calling algorithms 
using data from a lymphoma exome-seq study, PatternCNV has 
higher sensitivity and specificity. 

Availability and implementation: The software for PatternCNV is im- 
plemented using Perl and R, and can be used in Mac or Linux envir- 
onments. Software and user manual are available at http:// 
bioinformaticstools.mayo.edu/research/patterncnv/, and R package 
at https://github.com/topsoil/patternCNV/. 
Contact: Asmann.Yan@mayo.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

Received on September 1 1 , 2013; revised on April 22, 2014; accepted 
on May 22, 2014 

1 INTRODUCTION 

DNA copy number variations (CNVs) are genomic structural 
changes that result in regional or chromosomal loss or gain of 
DNA copies (Hastings et al, 2009). Owing to the significant roles 
in human diseases, various laboratory techniques have been 
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developed to detect CNVs, including recently advanced massive 
parallel sequencing of whole genomes and coding exomes. For 
exome-seq, it is commonly observed that coverage depths of 
short reads across regions vary, caused by different target cap- 
ture efficiencies (Parla et al, 2011), as well as the differences in 
mappability of exons. Such coverage variations impose substan- 
tial challenges for reliable CNV detection. Most existing methods 
use a paired-sample approach, based on the intuitive assumption 
that somatic sample and its paired reference share similar cover- 
age bias that can be cancelled out through pairing (Koboldt 
et al, 2012; Sathirapongsasuti et al, 2011). Although this as- 
sumption approximately holds, it oversimplifies the problem 
with two limitations unaddressed: (i) The region- specific noise 
(coverage variability) of a local region is not accounted for, lead- 
ing to amplified noise in log-ratio values of coverage between 
sample and the paired reference, (ii) In the case of a missing or 
low-quality reference sample, CNV detection based on paired 
reference will be infeasible or have degraded accuracy/sensitivity. 
A recent published method, FishingCNV, tried to address the 
second limitation by using the average of multiple reference sam- 
ples as the denominators in log-ratio calculation, but did not 
address the regional noises in individual samples (the numer- 
ator), which led to false CNV calls (details in Supplementary 
Section S2.3). Considering these issues, we proposed a novel 
method called PatternCNV, which summarizes overall consistent 
patterns of both depths and variability of exonic region coverage 
across samples, where 'patterns' of coverage and variability are 
summarized using multiple 'normal' or reference samples. We 
observed that the same patterns only exist between samples pre- 
pared using the same version of exome capture kit. During CNV 
detection, we compute the differences of observed coverage ver- 
sus the common pattern, while penalizing regions associated with 
larger variability using a weighting scheme. Further, whole- 
genome CNV can be interpolated from exon-level CNV using 
any third-party segmentation method, e.g. circular binary seg- 
mentation (Olshen et al, 2004). 

The PatternCNV was implemented in two different versions: a 
Mac and Linux/Unix version, and an R package version. We 
also developed a conversion tool to transform Binary version 
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of sequence Alignment/Map (BAM) format files to much smaller 
wiggle (WIG) format files (<1% of BAM file size), which greatly 
speeds up pattern learning and CNV calculation. When com- 
pared with other state-of-the-art CNV algorithms in a lymphoma 
case study, PatternCNV displayed higher resolution and greater 
sensitivity/specificity. 



2 FEATURES 

2.1 Input, output and major functions 

PatternCNV is divided into three major functional components: 
(i) BAM-to-WIG conversion for improved computational per- 
formance: a BAM2WIG converter using SAMtools (Li et aL, 
2009) and BEDtools (Quinlan and Hall 2010), which takes as 
input a BAM file, a file of Browser Extensible Data (BED) 
format defining exon regions and a second BED file for capture 
targets defined by the exome capture kit. The outputs are WIG 
files with greatly reduced file sizes compared with BAM files; (ii) 
CNV detection: starting with WIG files, PatternCNV estimates 
the coverage and variability patterns from multiple reference 
samples and calculates CNVs relative to the pattern for all sam- 
ples including the references; and (iii) CNV summary and visu- 
alization: this module outputs a detailed exon-level CNV 
summary file per sample, and provides several visualization op- 
tions for viewing CNVs at the whole-genome level or chromo- 
some level. In addition, there are built-in QA/QC steps to detect 
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Fig. 1. PatternCNV workflow is demonstrated in the upper panel. 
Examples of whole-genome and chromosome-level visulization are dis- 
played in the bottom panel, along with Exon-level CNV summary table 



sample outliers and batch effects. Figure 1 displays the overall 
workflow of PatternCNV along with illustrative examples of 
program output. 

2.2 Description of the PatternCNV algorithm 

Each exon is first divided into consecutive bins of user-defined 
size (e.g. 10 base pairs). To make the exon coverage of different 
samples comparable, log2-transformed RPKM (reads per kilo- 
base per million total reads) is used to standardize the bin cover- 
age. Denoting %/ as log2-transformed RPKM coverage of /-th 
bin in a given exon, the standard coverage of a bin without CNV 
is assumed to approximately follow a normal distribution 
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from a pool of reference samples as the coverage and variability 
patterns. For a bin with a copy number of C, the bin signal is 
calculated as r = log 2 (C/2), x/~iV(r + 07). Hence, a bin-level 
CNV can be estimated as f/ = x/ — \x b Considering variability of 
bin coverage depending on its relative position in an exon or with 
respect to capture probe, we further smooth multiple bins within 
£-th exon (we denote related bin indices as / e leading to a 
maximum likelihood estimation: f& = J2ieE k w i( x i ~ A/)? where w/ 
is designed to take variability of each bin into consideration (de- 
tails of the statistical formulation are described in Supplementary 
Section SI). 

2.3 Lymphoma case study 

We applied PatternCNV to a set of 15 germ line-tumor pairs of 
diffuse large B-cell lymphoma exome-seq data (Lohr et aL, 2012). 
When comparing CNV results derived from exome-seq using 
PatternCNV with those calculated from SNP microarray data 
profiled on the same samples, the two sets of results largely cor- 
relate for large CNVs. As expected, PatternCNV identified many 
small CNV regions at the single exon and/or multiple exon level 
(Supplementary Section S2.3) that the SNP array failed to detect 
owing to lack of probe coverage/density at the region. In add- 
tion, thanks to the digitalized dynamic range of read coverages, 
PatternCNV can differentiate high versus low amplifications, 
while microarrays are limited by the saturation of probe hybrid- 
ization signal. We compared PatternCNV with three other 
exome-seq-based CNV detection methods, ExomeCNV 
(Sathirapongsasuti et aL, 2011), Varscan2 (Koboldt et aL, 
2012) and FishingCNV (Shi and Majewski 2013) using CNV 
detected by SNP microarrays as the ground truth. PatternCNV 
displayed superior visual resolution and achieved better specifity 
and sensitivity when compared with the paired approaches used 
by ExomeCNV and Varscan2 (Supplementary Section S2.2), and 
had much less false positives compared with FishingCNV 
(Supplementary Section S2.3). In several focused comparisons, 
we also saw an increased resolution of PatternCNV-based esti- 
mations compared with these two methods (Supplementary 
Section S2.1). In situations where a reference sample had less 
reliable quality than its paired counterpart, we often observed 
dramatically reduced performance of both Varscan2 and 
ExomeCNV for CNV detection, but not PatternCNV 
(Supplementary Section S2.1). This highlights the robustness of 
the pattern-based approach over conventional paried 
approaches. FishingCNV uses a method of taking the average 
across normal samples, which is more similar to PatternCNV 
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than the paired methods used by the other two tools. However, a 
detailed comparison shows that FishingCNV has different data 
processing and CNV detection methods (Supplementary Section 
S2.3). FishingCNV's principle component analysis (PCA) step 
over corrects batch effects and consequently removes CNV sig- 
nals, resulting in false negative calls. We recommend that the 
users do not perform the default PCA step of FishingCNV. 
Moreover, it also oversimplifies average read-depth approach, 
producing an alarmingly high number of false-positive CNV 
calls (Supplementary Section S2.3). In contrast, PatternCNV's 
novel use of both the weighted average read depth and coverage 
variability produces results that are superior and simpler to use 
by improving true positives and greatly reducing false-positive 
CNV calls. 

3 DISCUSIONS AND CONCLUSIONS 

We introduce PatternCNV, a software package designed to focus 
on exon-level CNV detection from exome-seq data. CNV esti- 
mate is based on coverage and variability patterns summarized 
from multiple reference samples. The implemented algorithm 
uses WIG file format, which improves the runtime and space 
efficiency. Several post-processing functions are included to fa- 
cilitate interpretation, through visualization, segmentation and 
tabular summarization. As demonstrated by the case study, we 
believe it is a useful utility for exome-seq studies where robust 
detection of germ line and/or somatic CNVs is of interest. 
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